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Abstract 

We study the convergence of Z-estimators 9 {r]) G for which the objective function 
depends on a parameter 77 that belongs to a Banach space H. Our results include the uniform 
consistency over 71 and the weak convergence in the space of bounded K^-valued functions 
defined on 71 . Furthermore when 77 is a tuning parameter optimally selected at rjo, we 
provide conditions under which an estimated 77 can be replaced by 770 without affecting the 
asymptotic variance. Interestingly, these conditions are free from any rate of convergence of 
77 to 770 but they require the space described by 77 to be not too large. We highlight several 
applications of our results and we study in detail the case where 77 is the weight function in 
weighted regression. 
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1 Introduction 

Let P denote a probability measure defined on a measurable space { 2 , A) and let {Zi, ..., Z^) 
be independent and identically distributed random elements with law P. Given a measurable 
function / : Z —)• M, we define 

/ n 

fdP, Fnf = n-1 X 

i=l 

where Gn is called the empirical process. Considering the estimation of a Euclidean parameter 
Oq G Q C let (71, II • II) denote a Banach space and let { 0 {rj) : 77 G 71} be a collection 
of estimators of Oq based on the sample (Zi,..., Z„). Suppose furthermore that there exists 
770 G 71 such that 0 ( 770 ) is efficient within the collection, i.e., 0 ( 770 ) has the smallest asymptotic 
variance among the estimators of the collection. Such a situation arises in many fields of the 
statistics. For instance, 77 can be the cut-off parameter in Huber robust regression, or 77 might 
as well be equal to the weight function in heteroscedastic regression (see the next section for 
more details and examples). Unfortunately, 770 is generally unknown since it certainly depends 
on the model P. Usually, one is restricted to first estimate 770 by, say, rj and then compute the 
estimator 0 ( 77 ), which should result in a not too bad approximation of 6q. It turns out that in 
many different situations 0 ( 77 ) actually achieves the efficiency bound of the collection (see for 
instance Newey and McFadden (1994), page 2164, and the reference therein, or van der Vaart 
(1998), page 61). This is all the more surprising since the accuracy of 77 estimating 770 does not 
matter provided its consistency. 
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A paradigm that encompasses the previous facts can be developed via the stochastic equicon- 
tinuity of the underlying empirical process. Define the process r/ 1 —)■ = y/n{6{rj) — 9 q) and 

assume that it lies in the space of bounded M^-valued functions defined on Ti. It is 

stochastically equicontinuous on Ti if for any e > 0 , 

limlimsup P( sup |Z„(r 7 i) - Z„(r? 2 ))| > e ) = 0 , ( 1 ) 

5^-0 n^+oo 

where | • | stands for the Euclidean norm. Clearly, 9(r]) is efficient whenever y/n{9{rj) — 9{riQ)) = 
— ’^ni'no) goes to 0 in probability. This holds true if, in addition to (1), we have that 

P{r]G7i )—)• 1 and ||r? — r/o|| 0. (2) 

In other words, stochastic equicontinuity allows for a “no rate” condition on ^ given in (2). In 
fact, conditions (1) and ( 2 ) represent a trade-off we need to accomplish when selecting the norm 
II • II . When one prefers to have || • || as weak as possible in order to prove ( 2 ), one needs the 
metric to be strong enough so that (1) can hold. In many statistical problems, one has 

- 9o) = Gn^Pn + op(l), 

where the op(l) is uniform over r] € Ti and is a measurable function often called the 
influence function. This asymptotic decomposition permits to use empirical process theory 
and, in particular, the Donsker property in order to show (1) with || • || being the L 2 (P)-norm. 
As it is summarized in van der Vaart and Wellner (1996), sufficient conditions deal with the 
metric entropy of the class of functions {pr^ ■ V ^ that needs to be small enough. 

The main purpose of the paper is to establish general conditions for the efficiency of Z- 
estimators for which the objective function depends on some rj gT-L. More formally, we consider 
9q and 9{r]) defined, respectively, as “zeros” of the maps 

9 I—)■ Pipn{9) and 9 i—)■ Fn'ipri{9), 

where for each 9 G Q and r] gP, 'ip'q{9) is an M^-valued measurable map defined on Z. Since for 
every r] gP, = 0 , we have several (possibly infinitely many) equations that characterize 

9q. Hence r] can better be understood as a tuning parameter rather than as a classical nuisance 
parameter: rj has no effect on the consistency of 9{ri) but it influences the asymptotic variance. 
In Newey (1994), similar semiparametric estimators are studied using pathwise derivatives along 
sub-models and the author underlines that, for such models, “different nonparametric estimators 
of the same functions should result in the same asymptotic variance” (Newey, 1994, page 1356). 
In Andrews (1994), the previous statement is formally demonstrated by relying on stochastic 
equicontinuity, as detailed in (1) and (2). In this paper, we provide new conditions on the 
map {9,rj) '‘Prjfl) and the estimators fl under which the following statement holds: 9{r]) 
has the same asymptotic law as 0(r/o)- Despite considering slightly less general estimators 
than in Andrews (1994), our approach alleviates the regularity conditions imposed on the map 
9 I—>■ They are replaced by weaker regularity conditions dealing with the map 9 i—)• P'tprjfl). 

Moreover, we focus on conditional moment restrictions models in which rj is a weight function. 
In this context, our approach results in a very simple condition on the metric entropy generated 
by rj, which is shown to be satisfied for classical Nadaraya-Watson estimators. 

Our study is based on the weak convergence of {y/n{9{r]) — 0 o)}? 7 eW as a stochastic process 
belonging to i°°{P). To the best of our knowledge, this result is new. The tools we use in 
the proofs are reminiscent of the Z-estimation literature for which we mention several relevant 
contributions. In the case where 9o is Euclidean, asymptotic normality is obtained in Huber 
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(1967). Nonsmooth objective functions are considered in Pollard (1985). In the case where 9q is 
infinite dimensional, weak convergence is established in van der Vaart (1995). The presence of a 
nuisance parameter with possibly, slower than root n rates of convergence, is studied in Newey 
(1994). Nonsmooth objective functions are investigated in Chen et al. (2003). Relevant sum¬ 
maries might be found in the books Newey and McFadden (1994), van der Vaart and Wellner 
(1996), van der Vaart (1998). 

Among the different applications we have in mind, we focus on weighted regression for 
heteroscedastic models. Even though this topic is quite well understood (see among others, 
Robinson (1987), Carroll et al. (1988) and the references therein), it represents an interesting 
example to apply our results. In linear regression, given the data (T), Vj)j=i^...^„, where, for each 

1 = 1 ..., n, € M is the response and Xi S M*? are the covariates, it consists of computing 

n 

p{w) = argmin^ “ 0^Xifw{Xi), (3) 

i=l 

where w denotes a real valued function. Among such a collection of estimators, there exists an 
efficient member 9{wq) (see section 4.1 for details). Many studies have focused on the estimation 
of wq. For instance, Carroll and Ruppert (1982) argues that a parametric estimation of wq can 
be performed, and Carroll (1982) and Robinson (1987) use different nonparametric estimators 
to approximate wq. The estimators /3(wi) are shown to be efficient by relying on [/-statistics- 
based decompositions. It involves relatively long and peculiar calculations that depend on both 
w and the loss function. Our approach overpass this issue by providing high-level conditions 
on w that are in some ways independent from the rest of the problem. In summary we require 
that w{x) —)■ wo{x) in probability, dP(x)-almost everywhere, and that there exists a function 
space W such that 

Piw G W) —)• 1 and J Y^logAAj j (e, W, Lr{P))de < -|-oo, 

for some r > 2, where AAj j denotes the bracketing numbers as defined in van der Vaart and Wellner 
(1996). When wq is modelled parametrically, the previous conditions are fairly easy to verify. 
For nonparametric estimators of wq, in particular for Nadaraya-Watson estimators, smoothness 
restrictions on W with respect to the dimension q allows to obtain sufficiently sharp bounds on 
the bracketing number of W. 

The paper is organised as follows. We describe in Section 2 several examples of estimators 
for which the efficiency might follow from our approach. Section 3 contains the theoretical 
background of the paper. We study the consistency and the weak convergence of 9{r]), rj G T-L. 
Based on this, we obtain conditions for the efficiency of 9(r]). At the end of Section 3, we 
consider weighted estimators for conditional moment restrictions models. In section 4, we are 
concerned about the estimation of the optimal weight function tco in weighted linear regression. 
We investigate different approaches from the parametric to the fully nonparametric. In Section 
5, we evaluate the finite sample performance of several methods by means of simulations. 

2 Examples 

As discussed in the introduction, the results of the paper allow to obtain the efficiency of 
estimators that depend on a tuning parameter. This occurs at different levels of statistical 
theory. We raise several examples in the following. 

Example 1. (Least-square constrained estimation) Given 9 an arbitrary but consistent es¬ 
timator of 9q, the estimator 9c is said to be a least-square constraint estimator if it mini- 
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mizes (0 — 9)^T{6 — 9) over 9 £ Q, where T is lying over the set of symmetric positive def¬ 
inite matrices such that F > 6 > 0. Consequently 9c depends on the choice of F but since 
\9c - < 5-1|FV2(0^ _^0)|2 < 6-i|rV2(0Q _ 0)|2 ^ 0 in probability, the matrix F does not 

affect the consistency of 9c estimating 00. It is well known that 9c is efficient whenever F equals 
the inverse of the asymptotic variance of 9 (Newey and McFadden, 1994, Section 5.2). Such a 
class is popular among econometricians and known as minimal distance estimator. 

In the above illustrative example, the use of the asymptotic equicontinuity of the process 
F I—)• y/n{9c — 9q) is not really legitimate since we could obtain the efficiency using more basic 
tools such as the Slutsky’s lemma in Euclidean space. This is due of course to the Euclideanity 
of 9 and F but also to the simplicity of the map (0,F) i—)• (0 — 0)^F(0 — 9). Consequently, 
we highlight below more evolved examples in which either the tuning parameter is a function 
(Examples 2, 4 and 5) or the dependence structure between 9 and rj is complicated (Example 
3). To our knowledge, the efficiency of the examples below is quite difficult to obtain. 

Example 2. (weighted regression) This includes estimators described by (3) but other losses 
than the square function might be used to account for the distribution of the noise. Examples are 
Lp-losses, Huber robust loss (see Example 3 for details), least-absolute deviation and quantile 
losses. In a general framework covering every of the latter examples, a formula of the optimal 
weight function is established in Bates and White (1993). 

Example 3. (Huber cut-off) Whereas weighted regression handles heteroscedasticity in the 
data, the cut-off in Huber regression carries out the adaptation to the distribution of the noise 
(Huber, 1967). The Huber objective function is the continuous function that coincides with the 
identity on [—c, c] (c is called the cut-off) and is constant elsewhere. A Z-estimator based on 
this function permits to handle heavy tails in the distribution of the noise. The choice of the 
cut-off might be done according to the minimization of the asymptotic variance. 

Example 4. (instrumental variable) In Newey (1990), the class of nonlinear instrumental vari¬ 
ables is defined through the generalized method of moment. The estimator 9 depends on a 
so-called matrix of instruments W, and satisfies the equation ~ 0, where 

each Zi is some set of coordinates of Zi and (/j is a given function. A formula for the optimal 
matrix of instruments is available. 

Example 5. (dimension reduction) Li (1991) introduced sliced inverse regression in which 
the vector EXipiY), when if varies, describes a subspace of interest. A minimization of the 
asymptotic variance leads to an optimal that can be estimated (Fortier and Delyon, 2013). 

3 Uniform Z-estimation theory 

Define (2^ooj Aoo, Foo) as the probability space associated to the whole sequence {Zi, Z 2 ,...). 
Random elements in such as 7 / 1 —>• Gnifriid), are not necessarily measurable. To account 

for this, we introduce the outer probability (see the introduction of van der Vaart and Wellner 
(1996) for the definition). Each convergence: in probability or in distribution, will be stated 
with respect to the outer probability. A class of function H is said to be Glivenko-Cantelli if 
supjgjr |(Pn — P)f\ goes to 0 in P;J,-probabihty. A class of function H is said to be Donsker if 
Gnf converges weakly in to a tight measurable element. Let d denote the L 2 (F)-distance 

given by d{f,g) = ^P{f — gff. A class T is Donsker if and only if it is totally bounded with 
respect to d and, for every e > 0, 

lim lim sup ( sup | (/ - fif) I > e) = 0. (4) 

o->-0 n-^-i-ao d{f,g)<5 
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The previous assertion follows from the characterization of tight sequences valued in the space 
of bounded functions (van der Vaart and Wellner, 1996, Theorem 1.5.7). We refer to the book 
van der Vaart and Wellner (1996) for a comprehensive study of the latter concepts. 

For the sake of generality, we authorize the parameter of interest 6q being a function of r]. 

Hence we further assume that Oq(-) is an element of i°°{T-L). 

3.1 Unifrom consistency 

Before being possibly expressed as a Z-estimator, the parameter of interest Oq is often defined 
as an M-estimator, i.e., Oq € i°°(T-L) is such that 

00 (r?) = argmin^gQ Pm^{9), (5) 

where mrj{6) : Z —)• M is a known real valued measurable function, for every 0 E 0 and each 
ry G "H. The estimator of 0o is noted 0, it depends on rj since it satisfies 

9{r]) = argminegeP„m^(0). (6) 

Both elements 0o and 0 are M^-valued functions defined on T-L. When dealing with consistency, 
treating M-estimators is more general but not more difficult than Z-estimators (see Remark 
3). This generalizes standard consistency theorems for M-estimators (van der Vaart, 1998, 
Theorem 5.7) to uniform consistency results with respect to the objective function. 

Theorem 1. Assume that (5) and (6) hold. If 
(al) sup^g^_ |(Pn - P)mr^[9)\ ^ 0, 

(a2) sup^g^ |0(r/) - 0o(r/)| > <5 > 0 ^ sup^g^ P{m^{9{'q)) - m,,(0o(r?))} > e > 0, 

^ po 

then we have that sup^g-^ 1^(0) “ ^ 0 ( 0 )! 0. 

Proof. We can follow the lines of the proof of Theorem 5.7 in van der Vaart (1998). Given 
5 > 0, Assumption (a2) implies that there exists e > 0 such that 

( sup 10 ( 0 ) - 0 o(0)| > 5 ) <P^( sup P{mr,(9{r])) - m^(0o(0))} > eV 

\v&n J \v&H J 

By definition, P„{m^(0(7/)) — mr,(0o(0))} < 0 for every rj ^ P, then we know that 
P{m^(0(0)) - m^(0o(0))} 

= (P- Pn)m^(0(0)) + (Pn - P)mr^{9o{r])) + Fn{mr,{9{r])) - m^(0o(77))} 

< (P - P„)m^(0(r/)) -7 (Pn - P)m^(0o(0)) 

<2 sup |(Pn - P)m^(0)|, 

6»e0, ri&H 

that goes to 0 in outer probability by (al). □ 

Remark 1. Condition (al) requires {m^(0) : 0 G 0, ?? G P} to be Glivenko-Cantelli. It is 
enough to bound the uniform covering numbers or the bracketing numbers (van der Vaart and Wellner, 
1996, Chapter 2.4). When 0 is unbounded, the Glivenko-Cantelli property may fail. Examples 
include linear regression with Lp-losses. In such situations, one may require the optimisation set 
0 to be a compact set containing the true parameter. Another possibility is to use, if available, 
special features of the function 0 1 —)• m,j(0) such as convexity (Newey, 1994, Theorem 2.7). 
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Remark 2. Condition (a2) is needed for the identifiability of the parameter 9q. It says that 
whenever 9[-) is not uniformly close to 9q{-), the objective function evaluated at 9{-) is not 
uniformly small. Consequently, every sequence of functions 9n{-) such that sup^^g-^ — 

mr^(0o(^)) —>• 0 as n —)• +oo, converges uniformly to 9q{-). It is a functional version of the so 
called “well-separated maximum” (Kosorok, 2008, page 244). It is stronger but often more 
convenient to verify 

(a 2 ’) inf^g«inf| 0 _ 0 g(^)|> 5 P{m^( 6 ') - m,,( 6 »o(r/))} > 0 , 

for every <5 > 0. This resembles to Van der Vaart’s consistency conditions in Theorem 5.9 of 
van der Vaart (1998). To show this, suppose that sup^g-^ l^(^?) “ ^o(^)| > 2(5 and write 

P{m^{9{ini)) - m^(6io(7?))} > - ^7^r,(6'o(^/))} 

> l{|0(r,)-(9oWI>5} 

> l{|0(^)_0o(^)|>5} mf mf^^^^P{m40) -m^( 0 o(^ 

Conclude by taking the supremum over % in both side. 

Remark 3. Estimators dehned through zeros of the map Pni/’,y(0) are also minimizers of 
|PnV’? 7 (^)l' Therefore they can be handle by Theorem 1. Let 9q{-) be such that Pijjrj{9Q{r])) = 0 
for every rj If (al) holds replacing mhy tp and if 

sup \9{ri) — 9o{r])\ > 6 > 0 => sup |P' 0 ^( 0 ( 7 ?))| > e > 0, 

rj&'H. rj&'H 

then the uniform convergence of zeros of PnV'r;(^) to the zero of P'ipr){^) holds. Given that 
rurj is differentiable, M-estimators can be expressed as Z-estimators with as objective 

function. Because any function can have several local minimums, the previous condition with 
Vgrurj is stronger than (a 2 ). Consequently, for consistency purpose, M-estimators should not 
be expressed in terms of Z-estimators, see also Newey (1994), page 2117. 

3.2 Weak convergence 

We now consider the weak convergence properties of Z-estimators indexed by the objective 
functions. We assume further that 9o £ is such that for each r/ £ "H, 9q{t]) satisfies the 

p-dimensional set of equations 


= 0, (7) 

where ipr^iG) : Z —)• is a known measurable function. The estimator of 0o(') is noted 9{-) and 

for each r], it is assumed that 

PnV’77(^(^)) = 0- (8) 

Here we shall suppose that sup^g-^ l^(^) “ ^o(^)| = op^(l), so that the function V’r? is not 
intended to satisfy (a2). Indeed consistency may have been established from other restrictions 
such as minimum argument (see Remark 3). 

We require the (uniform) Frechet differentiability of the map 9 i—)• P'ipr^{9), that is, there 
exists : 0 I— )• such that 

Pi;r,{9) - PP;^{9) - A^{9){9 -9) = o{9 - 9), (9) 

where the o{9 — 9) does not depend on rj. 
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Theorem 2. Assume that (7) and (8) hold. If 
(a3) sup^g^ \e{r]) - 6»o(r?)| ^ 0, 

(a4) for every e > 0, 35 > 0 such that \6 — 6\ < 6 implies that su];)^^y^d{fjn{0),'f’r]{d)) < e, 

(a5) there exists 5 > 0 such that the class \E' = {z 'ifr){0){z) '■ \0 — 9f\ < 5, rj ^ H} is 

P-Donsker, 

(a6) the matrix Brj := Arj{9o{r])) defined in (9), is hounded and invertible, uniformly in r], 
then we have that 

-6»o(r/)) = -B-^Gn'ipriieoiv)) + opo^{l). 

Consequently, y/n{9{r]) — 9o{ri)) converges weakly to a tight zero-mean Gaussian element in 
whose covariance function is given by {'qi,r] 2 ) i—> 

Proof. We follow a standard approach by first deriving the (uniform) rates of convergence and 
second computing the asymptotic distribution (van der Vaart, 1998, Theorem 5.21). Thanks to 
(a3) and (a4), we know that there exists a nonrandom positive sequence Sn —)• 0, such that the 
sets 

{sup| 0 (r 7 ) -6»o(r/)| < 5„} and {sup d{fin{d{'n)),'f’v(.^i9))) < 

r/GH rj&'H 

have probability going to 1. Without loss of generality, we assume in the proof that these events 
are realized. By definition of 9{rf), we have 

op^(l) = ri^/^{PnV'T)(^) - -P'0??(6*0(?/))} 

= GniifniOir])) - 0r?(6'o(^))} + - '0??(^o(?/))}■ (10) 

The first term is treated as follows. For every r/ G "H, we have that ('07?(0(?7)), '0??(0o(^))) £ B{dn) 
with 


P’(^) = {('0)'0) G 'k X 'k : d{'tp,fi) < 5}. 

It follows that 

|Gn(V'??(%)) - V’??(6*0(??))) 1 < _sup |Gn( 0 ’-' 0 )|. 

(ilJp)GV(Sn) 

Now using (a5) and equation (4), the first term of (10) goes to 0 in P^-probability. As a 
consequence, we know that G„' 0 ^( 0 o(^))+\/re-P{ 0 T?(^(?/))—' 0 t?( 6 *o(^))} = opo^{l) or, equivalently, 
that 


GnfiniOoih)) + Br^n^P(9{ri) - 9o{r])) = anip) + op° (1), 


( 11 ) 


with ani-q) = -^/n{P{fir)(fi{q)) - '0??( 6 * 0 (??))} - Br,n^P(9{q) - 9o{q))}. Using (a 6 ), we have that 


o„(.i) < \n'/qe{„) - «o(i)))l sup IP’/’.W P’P.m A.,(e)(e 

\6l—B2\<&n \9 — 9\ 

< o(l) sup \n^P{9{q) - 9o{q))\. 
v&n 


( 12 ) 
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Then, by (a5), sup^g-^ \Gn'ipri{(^o{'n))\ = using (a6) again, in particular the full rank 

condition on Brj, we get 

|n^/2(0(r?) - 9o{r]))\ < |S^n^/2(0(ry) - 6'o(r?))| sup \B-^u\ = Op^(l). 

\u\=l 

Bringing the previous information in equation (12) gives that a„(r/) = op^(l), therefore by 
equation (11), we get 

GnipTjiOoiv)) + Brjn^/‘^{e{r]) - Ooiv)) = opg^{l), 

and the conclusion follows. □ 

Remark 4. Weak convergence of M-estimators is more difficult to obtain than weak conver¬ 
gence of Z-estimators. An interesting strategy is to focus on convex objective functions as 
developed in Pollard (1985). Unlike Theorem 2, this approach might include non-smooth ob¬ 
jective functions, e.g., least-absolute deviation. More recently, Kato (2009) considers convex 
objective functions that are indexed by real parameters. The main application deals with the 
weak convergence of the quantile regression process. 

Remark 5. In all the examples given in Section 2, Oq remains fixed as rj varies. Then -i/'rj, 
ry G "H, represents a range of criterion functions available for estimating Oq. In this context, 
condition (a4) becomes 

(a4’) whenever 9 —)• 9o, sup^g-^ d(V'^(0), V't)(^o)) 0, 

and condition (a6) is reduced to 

(a6’) there exists B^j G invertible and bounded uniformly in rj such that 

Pi>^{9) - PM9o) - Br,{9 - 9o) = o{9 - 9o), 


where the o{9 — 9 q) does not depend on r/. 

3.3 Efficiency 

In this section, Theorem 2 is used to establish conditions for the efficiency of 9{rf) estimating 
00 G ©• Hence we shall assume that for every r] gT-L, 9Q{rf) = 9 q (as in the introduction and in 
Remark 5). Given rj, a consistent estimator of rjQ, the next theorem asserts that, whatever the 
accuracy of rj, 9{r]Q) and 9{rj) have the same asymptotic variance. 

Theorem 3. Assume that (7), (8), (aS), (a4’), (a5) and (a6’) hold. If 

(a7) for every r] gH, 9o{r]) = 9o, 

^ ^ ^ P° 

(a8) there exists rj and r]Q gP such that P^{rj G %) —)■ 1 and \\rj — r/o|| 0, 

(a9) the quantities d(V’?7(0o)) V’? 7 o(^o)) o-n-d B^^ — B^^ goes to 0, as soon as ||r/ — r/o|| —)• 0, 

then 9{rj) has the same asymptotic law as 0(r/o)- 
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Proof. Write 


Hv) -00 = (Oiv) - 0ivo)) + {0{vo) - Oo), 

we only have to show that the first term is neglectabe, i.e., 9{fi) — 0(ryo) = By (a 8 ) 

and (a9), we can make the proof assuming that the event 

7/ G 7^, (^o)) ^ \^ff I ^ 

is realized, for a certain nonrandom positive sequence 5n —> 0. Then applying Theorem 2, we 
find 

n^/^{e{r]) - 9{r]o)) = BZ^Gnif^noiOo) - M^o)} + (-B"^ - BZ^)Gnf^r^oi0o) + opg^il). 

To obtain the convergence in probability to 0 of the first term, we use (a5) to rely on the 
stochastic equicontinuity, as in (4). By (a 6 ’), the second term equals B~^ — BZ^ = B~^{B;fj — 
Brify)BZ^ = 0{5n) times a term that is bounded in probability. 

□ 

3.4 Conditional moment restrictions 

We now consider conditional moment restrictions models given by 

E{^{Z,(3o)\X) = 0, (13) 

where X G X and Z ^ Z are random variables with joint law P and is a known M^-valued 
function. The conditional restriction (13) implies that infinitely many (unconditional) equations 
are available to characterize /3o, that is, for every bounded measurable function W defined on 
X, one has 


E(W(X)cp(Z,M) = 0. 

Assume that the sequence (Zj,Xj)i^<j<„ is independent and identically distributed from model 
(13). The estimator j3{W) satisfies 

n 

J]W(A,)(^(Zi,/3) = 0, (14) 

i=l 

for every W lying over the class of bounded functions denoted W. Note that it includes Example 
2 of Section 2, for which W is a real valued class of functions, and Example 4, for which W is a 
]^pxp-valued class of functions. The following statement sheds light on special features involved 
by the particular objective function {(3,W) e-)- W(-)(/?(-,/?). In the following, we implicitly 
assume enough measurability on the estimators W in order to use Eubini’s theorem freely. Eor 
this reason, outer expectations are no longer necessary. Define e{z) = sup^g^^ \ip{z,l3)\. 

Theorem 4. Assume that (13) and (14) hold. If 

(bl) sup^ygyi; |^(iy) - /3o | ^ 0. 

(b2) Whenever (3 /3o, E{ip{Z, (3) — ^{Z, /?o))^ —^ 0. 

(b3) Let 6 > 0 and let Bq denote an open ball centred at /3o- The class {z ip{z,l3) : (3 G Bq} 
is P-Donsker and Pe^^^ < +oo. Moreover, the class eW is P-Donsker and uniformly 
bounded by Woo. 
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(b4) There exists B : X —>■£ such that E\B{X)\ < +oo and, for every x £ X 

\E{^iZ,P) - ^{Z,Po)\X = x)- B{x){fi - /3o)| < k|/3 - PoW 

for some k > 0 and EW{X)B{X) is bounded and invertible, uniformly in W. 

(b5) There exist TV : i—)• M (suitably measurable) and Wq : Af i—)■ M such that (i) Poo{W £ 

W) —>■ 1, (a) \ W{x) — Wq{x)\ 0, P{dx)-almost everywhere, 

then fi{W) has the same asymptotic law as /3(Wo). 

Proof. We verify each condition of Theorem 3. Clearly (b2) implies (a4’) that is enough to 
get (a4) (see Remark 5). We now show that (b3)=^ (a5). Since tightness of random vectors 
is equivalent to tightness of each coordinate, we can focus on each coordinate separately. In 
what follows, without loss of generality, we assume that ip{z,l3) and W{x) are real numbers, 
for each f3 £ Bq, z £ Z, x £ X. Because the class of interest is the product of two classes: 
{z ip[z,j3) : /3 £ Bq} and W, we can apply Corollary 2.10.13 in van der Vaart and Wellner 
(1996). Given two pairs {/3, (3) and {W,W), we check that 

{W{x)ip{z,P) - W{x)ip{z,'^)f < 2{(f{z,(3) - <p{z,P)f + 2e{zf{W{x) - W(x))^ 

and we can easily verify every condition of the corollary. Note that the moments of order 2+ <5 of 
e have not been used yet. We now show that (b4) =► (a6’) by setting equal to EW{X)B{X) 
which is indeed invertible and bounded. We have 

E{W{X){if{Z,(3) - ^{Z,(3o) - B{X){fi - /3o))} < k1Roo|/5 - doW 

which implies (a6’). Note that (13) implies (a7) and (b5) implies (a8), with respect to the 
metric of pointwise convergence (in probability). It remains to show that (a9) holds with B^ 
equal to EW{X)B{X). Given e > 0, write 

I j{W{x)-Woix))B{x)P{dx)\< j \B{x)\ \W{x) - WQ{x)\P{dx) 

<e j \B{x)\P{dx) + 2Woc j \B{x)\l^^^^^.^_.^y^^^^^^^yP{dx). 

Taking the expectation, Fubini’s Theorem leads to 

E\ j{W{x) - Wo{x))B{x)P{dx)\ 

< e J \B{x)\Pidx) + 2 Woo j \B{x)\P^{\W{x) - Wo{x)\ > e)P{dx), 

the right-hand side goes to 0 by the Lebesgue dominated convergence theorem. Conclude 
choosing e small. Using that Ee^~^^ < -|-oo, the same analysis is conducted to show that 
Eip[Z, (5(ff‘{W{X) — Wq{X))‘^ goes to 0 in Poo-probability. 

□ 

Remark 6. Condition (b4) deals with the regularity of the map /3 E{y:)[Z, j3)\X = x). 

This is in general weaker than asking for the regularity of the map (3 (/?(z, /3). For instance it 

permits to include the Huber loss function (defined in Example 3) as an example. Note also that 
contrary to W, the class of function {z i->- ip{z, j3) ■. (3 £ Bq} is not suppose to be bounded. This 
is important to have this flexibility in order to consider examples such as weighted least-square. 
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Remark 7. Under the conditions of Theorem 4, the sequence y/n[(i{W) —/3o) converges weakly 
in to a tight zero-mean Gaussian element whose covariance function is given by 

(Wi,W2) ^ C^\E{WiiXMZ,PoMZ,(3ofW2{X))C^l, 

with Cw = E{W{X)B{X)). 

Remark 8. In the statement of Theorem 3, the metric || • || on the space T-L needs to be not too 
strong in order to prove (a8) and not too weak to guarantee (a9). In the context of conditional 
moment restriction, it is required that E{e{Z){W {X) — Wq{X))}'^ 0 whenever ||IU— Roll —^ 0. 

If R is a Nadaraya-Watson estimator, the uniform metric is too strong to prove (a8) because 
such an estimator can be inconsistent at the boundary of the domain. In virtue of the Holder’s 
inequality, one has 

f 2(r+2) '1 7^ 2 

E{e{Z){W{X) -Wo{X)))^ < I {U(R(X) - Ro(X))2+-} 2+r , 

for some r > 0, making reasonable to work with the L 2 +r(.f’)-iiietric. However the Lebesgue’s 
dominated convergence theorem permits to use the topology of pointwise convergence (in prob¬ 
ability). This results in weaker conditions on the moments of e{Z). 

Finally we rely on the bracketing numbers of the associated classes. Even if the following 
result is weaker than Theorem 4, the bracketing approach offers tractable conditions in practice. 
Moreover it allows for function classes that depend on n. 

Theorem 5. Assume that (13), (14), (bl), (b2) and (b4) hold. If 

(b3’) denote by ^ = {z : /? € Bq}, then 

r+ 

(i) / 

Jo 

fS, 

(ii) 

Jo 

for every sequence 5n —)■ 0, and Ee{Z)‘^~^^ < -|-oo, for some 5 > 0 and r = 

(b5’) there exist R : T i->- M (suitably measurable) and Rq : T i—)• M such that (i) Poo(R £ 
Wn) —>■ 1, (a) \W{x) — Wo{x)\ 0, P{dx)-almost everywhere, 

in place of (bS) and (b5), respectively, then f3{W) has the same asymptotic law as /3(Ro). 

Proof. The proof follows from Theorems 2 and 3 with the following change. To obtain that 

GnWi-JW - /3o)Ro} = Op{l), (15) 

we can no longer rely on the Donsker property since the class VJn depends on n. Instead we 
use Theorem 2.2 in van der Vaart and Wellner (2007), that asserts that (15) holds whenever 

E{ip{Z,P)W{X) - ^{Z,(5o)Wo{X)f ^ 0, 

^logAA[](e,$>V„,L2(P))o!e ^ 0, 

Pe^ = 0 ( 1 ), 0 , 



ylogA/J ](e,$,L2(P))de < -hoc, 
Y^logAA[ ][€,Wn,Lr{P))de —^ 0, 
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for every sequence 5n —)• 0. Hence the proof follows from verifying the three previous conditions. 
Using the Lebesgue dominated convergence theorem and the Fubini’s theorem (as it was done 
in the proof of Theorem 4 to obtain (a9)), the first condition is a consequence of (b2) and 
the pointwise convergence in probability of W. The second condition is a consequence of the 

bounds on the bracketing entropy of and Wn provided in (b3’). Given e > 0, let [ip.^TpA, 

_ —3 

j = be brackets of L2(T’)-size e that cover and let [W_j,Wj], j = be 

brackets of Lr{P)-size e that cover Wn- Because the function z ^ xy attains its bounds on 
every rectangle at the edges of each rectangle, the brackets 

[min(fi(jfc),max(5jfc)], i = A: = l,...,n 2 , 

with gjk = (i/? . Wi .,^ ■Wk,'<^jW i .,Tp,Wh), covers the class 4>kVn. Moreover, we have 

J — J — J J J 

I max(5rjfc) - min(5jfc)| < \lpjWk - ^■W.k\ 

<W^\Tp^-^^\ + \e\\Wk-Wkl 
then, using Minkowski’s and Holder’s inequalities, we get 

II max(fi(jfc) - mm{gjk)\\2 < e(VFoo + ||e||2+5)- 

Hence we have shown that, for every e > 0, AA(e(H4o + \\^\\ 2 +s),^y^n, L 2 {P)) is smaller than 
L 2 {P)) times Af[e,Wn, Lr{P)) ■ This implies the integrability condition. The third 
condition is simply obtained by using the 2 + (5-moments of the function e. □ 

4 Application to weighted regression 

In this section, we are interested in the estimation of /3o = (/3oi,/3o2) £ defined by the 

following model 

E(F|X) = /3oi + /3o^2A (16) 

where the conditional distribution of T — /3oi — 0 ^ 2 ^ given X G is symmetric about 0. For 
the sake of clarity, we focus on a linear model but under classical regularity conditions one can 
easily include more general link functions in our framework. We consider heteroscedasticity, 
i.e., the residual Y — /3oi — independent from the covariates Xi. In this context, 

the classical least-squares estimator is not efficient and one shonld use weighted least-squares 
to improve the estimation. Assume that (T,X), (Yi, Ai),..., (y„, are independent and 
identically distributed random variable form the model (16). A general class of estimators is 
given by /3(tc), defined by 

n 

p{w) = argmin(^j^^2)eKi+9 “ A “ Xi\)w{Xi), 

i=l 

where p : —>• is convex positive and differentiable and re : —>• M is called the weight 

function. Such a class of estimators is studied in Huber (1967), where a special attention is 
drawn on robustness properties associated to the choice of p. Note that when p{x) = we 
obtain the classical linear least-squares estimator, when p{x) = x, we get median regression, 
p{x) = (x^/2)l{o<a;<c} + c(x — c/2 )1{3 .>c} corresponds Huber robust regression (where c needs 
to be chosen in a proper way). Finally quantile regression estimators and L^-losses estimators 
are as well included in this class. 
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We consider three different approaches to estimate the optimal weight function wq. Each 
approach leads to different rates of convergence. The first one is parametric, i.e., we assume that 
Wo belongs to a class depending on an Euclidean parameter. The second one is non-parametric, 
i.e., we do not assume anything on wq except some regularity conditions. The third one is 
semiparametric and reflects a compromise between both previous approaches. 

It is an exercise to verify each condition of Theorem 5. Here we focus on the special conditions 
dealing with the estimator ui of tco, namely Conditions (b3’) (ii) and (b5’). The other conditions 
are classical and have been examined for different examples (Newey and McEadden, 1994). 


4.1 Efficient weights 

A first question that arises is to know whether or not such a class of estimators possesses an 
optimal member. The answer is provided in Bates and White (1993) where the existence of a 
minimal variance estimator is debated. Basically, optimal members must satisfy the equation: 
“the variance of the score equals the Jacobian of the expected score” (as maximum likelihood 
estimators). The optimal weight function is then given by 

V(0)/.pXi=a:(0) + E(<72,/3o(Ei,Xi)|Ai = x) 

Eig,,^,{Yi,Xi)\X,=x) 

where fifi,/ 3 (y,a:) = p'{\y - j3i - l3^x\)^, g 2 ,p{y,x) = p''{\y - j3i - j3jx\), ei = Yi - jSoi - /3^2^i 
and fei\Xi stands for the conditional distribution of ei given Xi. In the following, we restrict 
our attention to the case p'{0) = 0 , so that wq simplifies to 


^ Npojx 
Dpoix) 


with Np{x) = E{g 2 ,p(Yi, Xi)\Xi = x)f{x) and Dp = E{gi^p{Yi, Xi)\Xi = x)f{x), f being the 
density of Xi. Concerning the examples cited above, this restriction only drop out quantile 
regression estimators. 

A first estimator that needs to be computed is 13^'^), defined as /3{w) with 

constant weight function, w{x) = 1 for every x G M'^. Even if is not efficient, it is well 
known that it is consistent for the estimation of Po- Since wq depends on /3o, we use as a 
first-step estimator to carry on the estimation of wq. 


4.2 Parametric estimation of wq 

In this paragraph, we assume that tco(x) = tc(x, 7 o), where 70 belongs to an Euclidean space. 
Typically 70 is a vector that contains Pq. Such a situation has been extensively studied (see 
for instance Carroll et al. (1988) and the reference therein), and it has been shown under quite 
general conditions that 70 can be estimated consistently. As a consequence we assume in the 
next lines that there exists 7 such that 7 —)- 7 o, in P(J,-probabihty. The estimator of wo{x) is 
then given by w{x,j), for every x G M'^. As a consequence, Pq is estimated by 

n 

argmin(^^_^ 2 )gRi+,j ^p(|Fi - Pi- pj Xi\)w{Xi,p). 

i=l 

To verify (b3’) (ii) and (b5’), it is enough to ask the Lipschitz condition 

\w{x,j)-w{x,p)\ <\-f-p\. (17) 
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On the one hand, (b5’) holds trivially with Wn equal to the class {x w{x,'y) : I 7 — 70 I < 
5}, for any <5 > 0. On the other hand, (b3’) (ii) is satisfied because the latter class has 
the same size as the Euclidean ball of size 5. Condition (17) is sufficient but not necessary. 
Other interesting examples include for instance w{x,/3,^) = (1 + , reminiscent of a 

piecewise heteroscedastic model. 

In the parametric regression context given by (16), the parametric modelling of wq has 
serious drawbacks. Since a parametric form is assumed for the conditional mean, it is very 
restrictive in addition to parametrize the optimal weights. Moreover, the definition of tco, as 
a quotient of conditional expectations, makes difficult to set any plausible parametric family. 
Finally, theorem 5 does not require any rate of convergence for the estimation of wq. Hence a 
parametric approach is unnecessary. 


4.3 Nonparametric estimation of wq 

We consider the bracketing entropy generated by nonparametric estimator. The classical ap¬ 
proach taken for local polynomial estimators rely on the asymptotic smoothness of such estima¬ 
tors (Ojeda, 2008). In the Nadaraya-Watson case, this smoothness approach can not succeed 
since whenever the support of the targeted function is compact, the bias tends to a function 
that jumps at the boundary of the support. In the following, the Nadaraya-Watson case is 
studied by splitting the bias and the variance. To compute the bracketing entropy generated 
by these estimators, we treat similarly and independently the numerator and the denominator. 
For the numerator N^ we write it as EN -|- A^r. Since the class drawn by EN is not random, 
it certainly results in a smaller entropy than the function class generated by A^r, which turns 
to be included, as n increases and under reasonable conditions, in a smooth class of functions. 

For any differentiable function / : Q C —>• M and any I = {li,... ,lq) S N*?, let 

abbreviates —— j^f, where |/| = fcsN, 0 <q;< 1, M>0we say that 

dx . . ,dXq 

f G Cfc+o,M(Q) if) for every |Z| < k, exists and is bounded by M on Q and, for every |Z| = k 
and every {x,x') G we have 

\f^^\x) — f^'‘\x')\ < M\x — x'l“. 


We define the estimator by 


where 


{c(x) 


N{x) 

D{x)' 


N{x) = n ^'^g^ -^(Q){Yi,Xi)Kh{x - Xi), 

i=l 

n 

D{x) = n-^J 2 si^ 0 {o){Yi,Xi)Kh{x - Xi), 
i=l 


with Kh{-) = h '^K[-/h). We require the following assumptions. 

-.-V po 

(cl) The first step estimator is consistent, i.e., /3o. 

(c2) The variable Xi has a density / that is supported on a bounded convex set with nonempty 
interior Q C and there exists 6 > 0 such that m.ix&QDo{x) > b > 0. 
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(c3) There exists 0 < a 2 < 1, Mi > 0 and C an open ball centred at f3o2, such that for 
any x € the maps /3 i->- Np{x) and /3 i->- D^{x) belongs to Ca2,Mi{i3o). Moreover, the 
map x 1 -^ Dq{x) is uniformly continuous on Q, 

(c4) The kernel iii : —)• M is a symmetric {K{u) = K{—u)) bounded function compactly 

supported such that 

/ K{u)du = 1, / K{u)du > c > 0, 

whenever 0 < h < ho. Moreover, there exists fei € N such that for each |/| < /ci + 1, the 
class 

: /i > 0, X G q| is a bounded measurable VC class, 

(we use the terminology of Gine and Guillou (2002) including the measurability require¬ 
ments) . 

(c5) There exists 0 < ai < 1 such that as n increases. 



/i —0, 


\lo9{h)\ 

j^/jg-l-2(fci+Q!i) 


Let V(/) denote the set of all the functions that take their values in the set I. Define 

n — ^2,n 


where 


-4l,n 

~^ 2 ,n 

£N,n 

£D,n 


{Cfcl+ai,Mi(Q) + £N,n} H V[—M2, M2], 

{Cfcl+oi,Mi(Q) + £D,n} n V[c6/2, M 2 ], 

X !-)• y N^{x — hu)K{u)du : ^ G ) 
|x J Dp{x — hu)K{u)du : /3 G ^o| j 


and M 2 = 2Mi f \K{u)\du. 

Theorem 6. If (cl) to (c5) hold, we have 

Pqo{w G T'n.) 1) 

logAAj ](e,T'„, jj • jjoo) < const.e~‘^P^^~^'^^\ for any e > 0, 
where const, depends on q, Q, ki + ai, Mi and K. 

Proof. By (cl), we have that G Bq with probability going to 1. Introducing 

An{x) = N{x) - EN{x), 

Ad(x) = D{x) — ED{x), 

we consider the following three steps. 

(i) Poo{£^N £ Cki+ai,Mi{Q)) 1 and Poo{£^D £ Cki+ai,Mi{Q)) Ij 
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(ii) Poo{N G Ai^n) —^ 1 and Poo{D G ^ 2 ,n) —^ 1 (note that this is the first claim of the 
theorem), 

(iii) we compute the bound on the bracketing numbers of Pn- 

Proof of (i). We make the proof for A^r since the treatment of A/) is similar. Given I = 
(Zi, ... ,lq) such that |Z| < Zci + 1, we have that (x) = — 

Ai))], for any h > 0. Hence 

1 / n \ 

,(0/ ' 1 


A);^(x) = 




J29i,pio){Y„Xi)K^^\x - W) - Ai)A«(x - Ai)] , 

w=i ’ ’ / 

and we can apply Lemma 7 to get that 

sup|A™W|=Op„(^h|Mj. 

Then for 1 < |Z| < Zci, we know that A^^ goes to 0 uniformly over Q, making the derivatives of 
Atv (with order smaller than or equal to k^}, bounded by Mi with probability going to 1. Now 
we consider the Holder property for A^^ when |Z| = ki. For any \x — x'\ < h, by the mean value 
theorem, we have that 

|(aSJ)(x) - Al\x'))ix - < sup |V,a5J)(z)| |x - < sup \X,A^i}{z)\h^-‘^\ 


z&Q 


zeQ 


which is, in virtue of Lemma 7, equal to a Op^ (V nfei+W^ iy) = “ ^'1 > 


we have 

K^tv (®) - \ < 2sup|A^^(z)|Zi""i, 

zeS 

that has the same order as the previous term. As a consequence we have shown that 

sup |(AiJ(x) - aSJ(x'))(x-x')"“M = oPoc(l), 

x^x' 

implying that A^^ is ai-Holder (with constant Mi) with probability going to 1. 

Proof of (ii). For the first statement, using (i), it suffices to show that N lies in V[—M 2 ,M 2 ] 
with probability going to 1. Lemma 7 and Condition (c3) yield 

|A(x)| < |FlA(a:)| + sup |AAr(x)| 

xeQ 


= 1 J N^^oy{x - hu)K{u)du\ + Op{l) 

< Ml J I A(u)|(Zti + Op(l) 

< 2Mi J |A(ri)|cZtt with probability going to 1. 


For the second statement, it suffices to show that D lies in V[c6/2,M2] with probability going 
to 1. To obtain the upper bound for this class we mimic what have been done before for EN. 
To obtain the lower bound, first write 


ED{x) - {Do * Kh){x) = / (D^(o)(x - hu) - Dpq{x - hu))K{u)du, 
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by Condition (c3), it goes to 0 uniformly over x G Q. Then this yields 
D{x) = {Do * Kh){x) + ED{x) - {Do * Kh){x) + Ad{x) 

> {Do * Kh){x) - sup |(T)o * Kh){x) - ED{x)\ - sup |Ad(x)| 

x&Q xGQ 

= {Do * Kh){x) - op^(l). 


Define b{x,h) = inf^^eQ, \y-x\<hA Do{y) and M{x,h) = supj^gQ_ \y-x\<hA Do{y) where A is such 
that = 0 (A is finite because K is compactly supported), and note that, by the 

uniform continuity of Do, sup^-gg \M{x, h) — h{x, /i)| —)• 0 as h —)• 0, it follows that 


{Do*Kh){x) = J Do{x + hu)K{u)du 

>b{x,h) j I{^+hueQ}{K{u)}+du +M{x,h) j I{x+hueQ}{K{u)}-du 

= b{x,h) J I{^+hueQ}K{u)du +{M{x,h)-b{x,h)) j I{^+hu&Q}{K{u)}-du 

b{x, h) / \x+hu&Q.}K{u)du - o(l) 

K{u)du — o(l), 


> 

> b 




that is greater than cb/2 > 0 whenever n is large enough. 

Proof of (in). The bound on the bracketing numbers of the associated class is obtained as 
follows. First, Corollary 2.7.2, page 157, in van der Vaart and Wellner (1996) states that 

logM[ ]{e,Ck^+ai,Mi{Q), II • Iloo) < const. 

where const, depends only on Q, ki + ai and Mi. Second by Assumption (c3). 


J Np{x — hu) — Nf^i{x — 
J Dp{x — hu) — Dpi{x — 


hu)K{u)du\ < IP — (3 
hu)K{u)du\ < 1/3 — /? 


'|“2Mi J 

f \K{u)\du. 

I 

f \K{u)\du. 


hence, the classes SN,n and £D,n have a || • ||oo-bracketing numbers that are smaller than the 
I • |-covering numbers of Bo (up to some multiplicative constant) that are obviously smaller than 
the II • ||oo-bracketing numbers of (Q)- Since the class Ai^n (resp. A 2 ,n) coincides with 

the set Cfe^+Q,^^Af^(Q) plus En^u (resp. £D,n), the bracketing numbers of Ai^n (resp. A 2 ,n) are 
then smaller than the square of the bracketing number of (Q); i-e-, 


logAA[ ](e,^j,n, II • Iloo) < const.e 


for j = 1,2. Let [Ai, A^i],..., (resp. [Di,Di],..., D^j]) be || • ||oo-brackets of 

size e that cover Ai^n (resp. ^ 2 ,n)■ Clearly, by taking Di V 6,..., V 6 in place of Di,..., D ^^, 
we can assume that the elements of the brackets of A 2 ,n are larger than or equal to b. By a 
similar argument, every brackets of Ai^n are bounded by M 2 . Then for any N G Ax^n and 
D G .A 2 ,n 5 there exists 1 < i < ni and 1 < j < n 2 such that 


N. N Ni 
^ < — < —, 
D, - D- Dd 





< const.e, 

CO 
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where const, is a constant that depends only on b and M 2 . As a consequence we have exhibited 
an II • 11 00 -bracketing of size const.e with nin 2 elements, yielding to the statement of the theorem. 

□ 

Remark 9. On the one hand, no strong assumptions are imposed on the regularity of the 
targeted functions Nq and Dq. Actually, we only require the uniform continuity of Dq to hold. 
On the other hand, the kernel needs to be many times differentiable. Hence it consists of 
approximating a function, non necessarily regular, by a smooth function. In this way, we can 
control the entropy generated by the class of estimated function. 

Remark 10. Another way to proceed is to consider the classes 


£n = s a: !-)• / A|a(x — hu)K{u)du : /3 G Bq, h > 0 


£d = \x ^ / Dpix — hu)K{u)du : /3 G Hq, h > 0 


in place of £N,n and £D,n- The resulting classes are larger but they no longer depend on n. To 
calculate the bracketing entropy of the spaces £]\f and £d, one might consider the Lr(P)-metric 
rather than the uniform metric because the latter involves some difficulties at the boundary 
points. 

Remark 11. The assumptions on the Kernel might seem awkward in the first place. Examples 
of kernels that satisfy the last requirement in (c4) are given in Nolan and Pollard (1987) (Lemma 
22), see also Gine and Guillou (2002). An interesting fact is that {K (^^) : h > 0, x G M} is a 
uniformly bounded VG class of function, whenever K has bounded variations. The assumption 
that K(u)du > c > 0 holds true if Q is a smooth surface, i.e., the distance between 

X G and Q is a differentiable function of x. Note also that in the one-dimensional case, it 
is always verified. Moreover, this condition permits to include the case of non-smooth surfaces 
such as cubes. 


4.4 Semiparametric estimation of wq 

The nonparametric approach involves a smoothing in the space It is well known that 
the smaller the dimension q, the better the estimation. Although it does not affect the the 
efficiency of the estimators (any rate of convergence of w to wq satisfies condition (b5’)), it 
certainly influence the small sample size performance. 

There exist different ways to introduce a semiparametric procedure to estimate wq. We rely 
on the single index model. In our initial regression model (16), the conditional mean of Y given 
X depends only on /3(^A. Given this, it is slightly stronger to ask that the conditional law of 
Y given X is equal to the conditional law of Y given fd^X, in other words, that 


or equivalently that. 


y X xip^^x, 


E{giY)\X) = Ei9iY)\fd^2X), 


(18) 


for every bounded measurable function g. Such an assumption has been introduced in Li 
(1991) to estimate the law of Y given X. Here (18) is introduced in a different spirit than 
in Li (1991): since we have already assumed a parametric regression model in (16), condition 
(18) is an additional mild requirement, that serves only the estimation of wq. The calculation 
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of semiparametric estimators of wq is done by using similar tools as in the previous section. 
In order to fully benefit from Condition (18), we realize the smoothing in a low-dimensional 
subspace of We dehne the estimator by 




where 


= n-i - PjXi), 

i=l 

n 

b^{t) = - 0iXi), 

i=l 

with Lh{-) = h~^L{-/h). The proofs are more involved than in the nonparametric case because 
of the randomness of the space generated by in which the smoothing is realized . 


5 Simulations 


The asymptotic analysis conducted in the previous sections demonstrate that the estimation of 
wq does not matter provided its consistency, e.g., tci and W 2 can converge to wq with different 
rates, while (i{wi) and P{w 2 ) are asymptotically equivalent. Nevertheless when the sample 
size is not very large, differences might arise between the procedures. In the next we consider 
the three approaches investigated in the latter section, namely parametric, nonparametric and 
semiparametric. Each of these procedures results in different rates of convergence of w to 
Wq. Here our purpose is two folds. First to provide a clear picture of the small sample size 
performances of each method, second, from a practical point of view, to analyse the relaxation 
of the smoothness properties of wq. 

We consider the following heteroscedastic linear regression model. Let the sequence (^i, 
be independent and identically distributed such that 

Y, = Poi+/3j2Xi+MXi)eu 

where (Xi,ei) G M^+i j^^as a standard normal distribution and (/3oi,/3o2) = (1, ■ ■ ■, l)/\/9 + 1- 
The weighted least-square estimator is given by 

Mw) = (19) 


with Y{w) = n ^Y!i=iXiXfw{Xi), j(w) = n ^Ybi=iXiYiw{Xi) and Xj = (l,Xf), for i = 
l,...,n. Let with j3f'^ G M and G M*?, denote the hrst step estimator 

with constant weights. For different sample sizes n and also several dimensions q, we consider 
two values for ao: one smooth function and one discontinuous function, given by, respectively. 


(Joi(a;) = 


1/^021 


and ao 2 {x) = ^ + 2 • ^{fiX^xyoy 


In each case, the optimal weight function wq = I/ctq^ is estimated by these methods; 


(i) parametric, w is computed using in place of (5q in the formula of wq, 
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(ii) nonparametric, applying a kernel smoothing procedure (of the Nadaraya-Watson type), w 
is given by 

_ T:^=iKh{Xi-x) _ 

E7=liY^ - - x) ’ 

(iii) semiparametric, applying a kernel smoothing procedure in a reduced sample-based space, 
w is given by 

Y.UKhiA{X,-x)) 

T."=l{Y^ - - d?^X,yKh{A{X, - x)) ’ 

where A = + el and ^ 2 ^^ denotes the orthogonal projector onto the linear space 

generated by . 

For (ii) and (iii), the kernel K is the Epanechnikov kernel given by K{u) = Cq{l — |ttp)+, where 
Cq is such that f K{u)du = 1. Whenever w is computed according to one of the method (i), (ii) 
or (iii), the hnal estimator of /3o is computed with /3(u;) given by (19). 

As highlighted in the previous section, the bandwidth h needs to be chosen in such a way 
that w is smooth. In practice, we find that choosing h by cross validation is reasonable. More 
precisely, we consider the estimation of crQ(x) by u^(x), i.e., 

n 

hcv = argmin;,>o ^((E* - Xif - , 

i=l 

where u^(“®^(x) is either the leave-one-out nonparametric estimator of <Tq{x) given by (ii) or 
the leave-one-out semiparametric estimator of o'q(x) given by (iii). Such a choice of h has the 
advantage to select automatically the bandwidth without having to take care of the respective 
dimensionality of each procedure semi- and nonparametric. In every examples, the semipara¬ 
metric hcv was smaller than the nonparametric hcv 

For the semiparametric method, the matrix A denotes the orthogonal projector onto the 
space generated by /3o perturbed by e in the diagonal. This permits not to have a blind confidence 
in the first estimator /3o accounting for variations of wq in the other directions. Hence e is 
reasonably selected if el has the same order as the error — p !^'^, where p!f'^ is the orthogonal 
projector onto the linear space generated by 13^'*. Computed with the Frobenius norm | • 1^?, it 
yields 

> - pf>||. = 2tac<j((/ - p<"')pl“') = 

1^2 I 

The numerator is approximated by an estimator of the average value of its asymptotic law (in 
the case where e X X), it gives where Xk are the eigenvalues associated to 

the matrix (/ — P 2 °^)S^^(/ — = n~^ “ /^o ~ l^o Xi)^ and denotes the qx q 

lower triangular block of the inverse of n~^ SILi XiXf ■ Xs a consequence e is given by 


e = 


\ ng |^|2 


where y/q appears as a normalizing constant. 
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Figure 1: Each boxplot is based on 500 estimates of |/3 —/3oP when q = A and o'o{x) = & Tto 
parametric, nonparametric (np) and semiparametric (sp) approaches are respectively based on 
(i), (ii) and (hi). 
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Figure 2: Each boxplot is based on 500 estimates of |/? — /3oP when q = IQ and cro(x) = 

The parametric, nonparametric (np) and semiparametric (sp) approaches are respectively based 
on (i), (ii) and (hi). 
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Figure 3: Each boxplot is based on 500 estimates of |/? — /3oP when q = A and aQ{x) = 
^ + 2 • The parametric, nonparametric (np) and semiparametric (sp) approaches are 

respectively based on (i), (ii) and (iii). 
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^ + 2 • The parametric, nonparametric (np) and semiparametric (sp) approaches are 
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The figures 1 to 4 gives boxplots associated to the estimation error of each method, para¬ 
metric (i), nonparametric (ii), and semiparamteric (hi), according to different values of n = 
50,100,500, q = 4,16 and uo = croi,cro 2 . We also consider the first-step estimator and a 
“reference estimator” computed with the unknown optimal weights, i.e., l3{wo). In every case, 
the accuracy of each method lies between the first step estimator and the reference estimator. 
In agreement with Theorem 5 and 6, the gap between the reference estimator and the method 
(i), (ii), (iii) diminishes as n increases. Each method (i), (ii), (hi), performs differently showing 
that their equivalence occurs only at very large sample size. 

Among the three methods under evaluation (i), (ii), (iii), the clear winner is the semipara- 
metric method (with selection of the bandwidth by cross validation). The fact that it over-rules 
the nonparametric estimator was somewhat predictable, but the difference in accuracy with the 
parametric method is surprising. In every situation, the variance and the mean of the error 
associated to the semiparametric approach are smaller than the variance and the mean of the 
others. Moreover, the nonparametric method performs as well as the parametric method even 
in high-dimensional settings. In fact, both approaches are similarly affected by the increase of 
the dimension. Finally, one sees that the choice of the bandwidth by cross validation works 
well for both methods nonparametric and semiparametric. In all cases, the estimator with hcv 
performs similarly to the estimator with the optimal h. 


6 Appendix: concentration rates for kernel regression estima¬ 
tors 


The result follows from the formulation of Talagrand inequality (Talagrand, 1994) given in 
Theorem 2.1 in Gine and Guillou (2002). 

Lemma 7. Let {Yi G M, Aj G M*?, i = 1,... ,n) denote a sequence of random variables inde¬ 
pendent and identically distributed such that Xi has a bounded density f. Given AT : i—)• M 

such that f K(u)^du < -|-oo and T a class of real functions defined on if both classes T 

and {K (^) , x G h > 0} are bounded measurable VC classes, it holds that 


sup - V {'iJj{Yi,X,)Kh{x - Xi) - E[if{Y,,X,)Kh{x - X,)]} 
xeQ IT' 






Proof. The empirical process to consider is indexed by the product class ^JCh, where JCh = 
X G which is uniformly bounded VC since the product of two uniformly bounded 
VC classes remains uniformly bounded VC. The variance satisfies 

var(V>(Ei,Ai).A(/i-^(x-Ai)) < Ef;{Yi, Xi)^K{h-^{x - Xi))"^ 

<\mlo\\f\\oohi I K{u)Hu, 

and a uniform bound is given by ||?/)A||oo < sup^g,j, ||'(/l||oo||A||oo. The application of Theorem 
2.1 in Gine and Guillou (2002) gives the specified bound. □ 
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